IMSE: interaction information attention and molecular structure based drug drug interaction extraction

Background Extraction of drug drug interactions from biomedical literature and other textual data is an important component to monitor drug-safety and this has attracted attention of many researchers in healthcare. Existing works are more pivoted around relation extraction using bidirectional long short-term memory networks (BiLSTM) and BERT model which does not attain the best feature representations. Results Our proposed DDI (drug drug interaction) prediction model provides multiple advantages: (1) The newly proposed attention vector is added to better deal with the problem of overlapping relations, (2) The molecular structure information of drugs is integrated into the model to better express the functional group structure of drugs, (3) We also added text features that combined the T-distribution and chi-square distribution to make the model more focused on drug entities and (4) it achieves similar or better prediction performance (F-scores up to 85.16%) compared to state-of-the-art DDI models when tested on benchmark datasets. Conclusions Our model that leverages state of the art transformer architecture in conjunction with multiple features can bolster the performances of drug drug interation tasks in the biomedical domain. In particular, we believe our research would be helpful in identification of potential adverse drug reactions.

Traditionally, doctors have obtained the latest information on DDI from two main sources: reading numerous biomedical papers to learn about DDI or querying DDI from biomedical databases. In the biomedical field, the number of biomedical literature has been increased rapidly. Obviously, reading a large number of papers is inefficient. As for biomedical databases, it seems possible, but in the consideration of the quantity of the biomedical literature, it requires a lot of resources to update and revise a professional database manually. So, two of these methods are not ideal for obtaining DDI.
The DDI extraction task [6] aims to extract DDI from free texts in the biomedical field. DDIExtraction 2013 task seeks to classify each DDI candidate according to one of five types ("Advise", "Effect", "Mechanism", "Int" and "False").
In the early days, people often adopt the pattern-based methods and feature-based machine learning methods [7,8], but methods based on pattern requires the annotator to have certain domain knowledge, and the main drawback of this method is both timeconsuming and inefficient.
In recent years, methods based on pre-training [29,30] have achieved good results. Lee et al. introduced BioBERT (bidirectional encoder representations from transformers for biomedical text mining) to improve DDI extraction [31], the authors pre-trained BioBERT on PubMed abstracts (PubMed) and PubMed Central full-text articles (PMC). Boukkouri et al. put forward a new variant of BERT [32], it is completely abandoned the chunk system, and use a character-CNN module instead of by query their characters to represent whole words. Recently, Sun et al. [33] further improved the extraction effect of DDIExtraction 2013 task by introducing Gaussian vector and other external knowledge on the basis of BioBERT.
The above mentioned solutions come with some drawbacks. First, as shown in Fig. 1, in the sentence that contains the DDIs, there are multiple complex drug drug interactions. For example, drug 'alosetron' and the other three drugs ('isoniazid' , 'procainamide' , 'hydralazine') all have effects. We called this relationship overlap, for this kind of Fig. 1 An example of overlapping relationships. Drugs entities are labeled. This example shows that there are multiple drug entities and multiple interactions in a sentence complex relationship overlap, the above methods do not have good solutions. The newly proposed attention vector is designed to better deal with the problem of overlapping relationships. Second, there is a plethora of drug feature information available for many approved drugs, including molecular structure, drug SMILES, and more. All of the above methods ignore this additional but very useful information. Therefore, we obtained the SMILES molecular formulas of the drugs through the DrugBank [34] database, and transformed the SMILES molecular formulas into the fingerprints and adjacency matrices through the RDKit toolkit [35] to get drug molecular features. In addition, the location of drug entities in the text is also very helpful to extract specific drug relationships. For this purpose, we introduce T-distribution and Chi square distribution to obtain the sentence feature which focus on drug entities.
Rest of our work is organized as follows. We first discuss our approach in detail i.e system architecture, then experimental setup, training and evaluation metrics. This is followed by a discussion of the experimental results and in the end we make a conclusion.

Methods
In this section, we introduce our system architecture ( Fig. 2) and explain different modules it invokes in a sequential manner. Figure 2 shows the architecture of our approach. Our model is divided into four parts. First we use Biobert to encode the input sentence, and get the last hidden state ( H seq ) of Biobert. Then, we generate attention vector according to the positions of DRUG1 and DRUG2 and obtained 'interaction features' which are helpful to identify overlapping relation, and we got 'entities attention features' which focus on entities by introducing Chi-square distribution and T-distribution. Last, we obtained the drug structure according to the DrugBank database and the RDKit tool Fig. 2 Architecture of the proposed model. Our model is mainly distributed in four parts. First, BioBERT is used to encode the input sentences, and meanwhile, molecular graph neural network is used to encode the drug structure. Then, Interaction attention vector and Entities attention vector are respectively generated to combine the output of Biobert. Finally, all the obtained information is sent to the classifier for prediction library, and we use molecular graph neural network [36] to generate high quality molecular representations. Finally, we combined all the information to make the classification of DDI. We will introduce our model and method in detail in the following content.

Drug mask
For the two drug entities in the sentence that need to classify the type of relationship, 'DRUG1' and 'DRUG2' were used to replace them, and for other drug entities in the sentence, 'DRUGOTHER' was used to replace them.

Over-sampling and under-sampling
DDI 2013 dataset has long-tail distribution phenomenon. Therefore, under-sampling and over-sampling methods are adopted to optimize the training set. Compared with the original data, a large number of repeated negative samples are filtered out after undersampling while oversampling can significantly increase the number of sparse samples.

Sentence encoder
The goal of this component is to obtain the context-aware representation of each token in a sentence. Given the impressive performance of recent deep transformers trained on variants of language modeling, we utilize the BioBERT model as the sentence encoder. The BioBERT was pre-trained on a lot of PubMed abstracts and PubMed Central fulltext articles, and see Fig. 3 for the overview of the pre-training. Given an sequence ( S = {x 1 , x 2 , . . . , x n } ) as input, BioBERT can be formulated as follows: Fig. 3 Overview of the pre-training of BioBERT [31]. BioBERT is a domain-specific language representation model pre-trained on large-scale biomedical corpora. With almost the same architecture across tasks, BioBERT largely outperforms BERT and previous state-of-the-art models in a variety of biomedical text mining tasks when pre-trained on biomedical corpora where xi is the i-th token, L is the total number of layers for BERT, l (1 < l < L ) is the l-th layer. Equation (1) indicates input embeddings, Eq. (3) denotes the representation of the i-th token, and Eq. (4) denotes the representations of the sequence. The trans-former_block in Eq. (2) contains multi-head attention layers, fully connected layers, and the output layer. Furthermore, the parameters W e , W b , and transformer_block are pre-trained on large-scale corpora using two unsupervised pre-training tasks, masked language model and next sentence prediction. The output of the BioBERT model is the context-aware embedding of tokens, and is denoted as H seq ∈ R n×d , where n is the sentence length (including [CLS] and [SEP], two special start and end markers), and d is the number of hidden units in the BERT model.

Interaction attention vector
In this part, we will introduce the interaction attention vector. As shown in the sentence in Fig. 1, if the two drugs have some interaction, they will be far apart in the sentence, whereas if they do not interact, they will be closer together, suggesting that the important information characterizing the interaction is often between the two entities. In addition, we found this pattern in the vast majority of biological texts. So when dealing with overlapping relationships we should focus on the information between the current pair of drugs. In Table 1, we give the statistics of the interaction information between entities in SEMEVAL-2013 DATA SET. We have performed the statistics for the training set, validation set and test set separately, and the results show that the large batches of data among the datasets fit this pattern. That is to say we should pay more attention to the content between entities. In the following we will explain in detail how to generate the interacting attention vector and how to use it.
We define a high weight range and a lower weight range: We assign the high weight to the information between the two drugs, and lower weight to the rest of the sentence. In order to keep this range of weights elastic, we add an oscillation factor σ to the weight range. The weight after adding the shock factor are as follows, W hign is the weight range we will assign to the information between the two drugs, W low is the weight range we will assign to the rest of the sentence: where σ is the oscillation factor (here we take 0.1). We define a sentence S = {x 1 , x 2 , . . . , x n } , the final Interaction attention vector we proposed is defined as follows: where start and end means the start and end of a sentence, j and k means the positions of head enity and tail enity in the sentence, H int is the final Interaction attention vector, and its visual representation is as follows: The output of the BERT model ( H seq ∈ R d×1 ) is the context-aware embedding of tokens. Then we do a matrix operation with H int , the output of this step is like Fig. 4, each word embedding of H seq is given a different weight, the formulas are as follows: where input_sentence is the original input of the sentence, H it ( H it ∈ R d×1 ) is the synthesis vector obtained after the fusion of H int and H seq , ⊗ means matrix multiplication.
We also apply the average operation on the comprehensive vector representation H it : where H interaction_attention is the output after the average processing of H it and a fully connected layer, j and k are the positions of first drug and second drug. For the hidden state output H seq , we first get pooled output from it, then add an activation operation and a fully connected layer, which is formally expressed as: where matrices W int ∈ R n×1 , W 0 ∈ R n×1 are weight matrices and they have the same dimensions, b int , b 0 are bias of neural network.

Entities attention features
In this study, we introduce the Chi-square probability distribution and T-distribution to enhance the weights of the target entity and its adjacent words, so that the model can learn the local structure of entities. We refer to these two modified distributions together here as the Entities attention vector. The Chi-square probability density function is: the Chi-square probability distribution function is: the T probability density function is: the T probability distribution function is: Fig. 4 Visual representation of the Interaction attention vector. The horizontal axis represents each token in the sentence, and the vertical axis represents the different weights assigned to them according to the Interaction attention vector. The information between the two drug entities is important that it is given a high weight, and the rest is given a low weight where x is a real value, Ŵ is the Gamma function , k is the degree of freedom in Chisquare distribution, n is the degree of freedom in T distribution and t is the step size of the Chi-square function, w is the step size of the T function. We first get the values of the distribution and then locate the position of the two entities in the sentence. We map 15% before the first entity according to the rule of right-toleft position and high-to-low value, and 25% after the first entity according to the rule of left-to-right position and high-to-low value, and then do a symmetric operation on the right entity.
Finally, we obtained the Entities attention vector H ent as Fig. 5, then we do a matrix multiplication with the last hidden state output of BioBERT ( H seq ), and add an activation operation and a fully connected layer. The formula is as follows: where H ent is the Entities attention vector, H et ( H et ∈ R 1×d ) is the synthesis vector obtained after the fusion of H ent and H seq , and H entities_attention is the output after an activation operation and a fully connected layer, W ent ∈ R n×1 is the weight matrix, b ent is the bias of fully connected layer.

Molecular structure
DrugBank is a freely available drug database containing more than 10,000 drugs. According to the name of drugs, we first found the SMILES formulas of the corresponding drugs in the DrugBank database, and then extracted the corresponding drug structures using the (17) Fig. 5 Visual representation of the Entities attention vector. The horizontal axis represents each token in the sentence, and the vertical axis represents the different weights assigned to them according to the Entities attention vector (combination of modified T-distribution and Chi-square distribution) extraction method provided by Tsubaki et al [36]. Figure 6 shows the molecular structure of the drug that we obtained. We use r-radius subgraphs which are induced by the neighboring vertices and edges within radius r from a vertex. This r is the number of hops from the current vertex to the nearby vertexes, and we take radius r 1 here.
We use molecule graph neural network to encode molecular graph structures. GNNs convert a drug molecule graph G into a fixed size vector. We represent atoms as nodes and bonds as edges in the graph. Then we feed the fingerprint vectors into the molecule graph neural network which takes the input ngerprint vectors as the initial vectors and updates them according to the structure of the molecular graph. We dene the vector of the i-th atom in a drug molecule as m i and the set of its neighboring atoms as N i . The vector m i is updated in the ℓ-th step as follows: where f() denotes a ReLU function. The drug molecular vector is obtained by summing up all the atom vectors and then the resulting vectors are fed into a linear layer.
where M is the number of fingerprints, H mol is the output of MGNN, and we thus obtained the molecular structure output of the two drug entities: H mol1 and H mol2 .

Softmax layer
In this step, we combine the output obtained earlier to make the fusion representations H . Then a fully connected neural network is employed to learn the representations H . Finally, the softmax function is used to calculate the probability P belonging to the DDI type r: where W * and b * are weight parameters and bias parameters, and we use the crossentropy function as the loss function.

Dataset and experimental settings
DDIExtraction 2013 corpus is a manually annotated drug-drug interaction (DDI) corpus based on the DrugBank database and MEDLINE abstracts. This corpus contains four DDI types for evaluation purposes which are ' Advice' , 'Effect' , 'Mechanism' , 'Int' , and a label named 'False' which indicates no interaction. We formulate DDI extraction into a multi-class classification problem. We follow DDIEXTRACT-2013 shared tasks (semeval-2013 Task9.2), here we refer to the DDI corpus provided by Asada et al. [15]. Table 2 illustrates the statistics for the instances in the DDIExtraction 2013 dataset. The task defines the following four interaction labels.
• Mechanism: this type is assigned when a pharmacokinetic mechanism is described in an input sentence. • Effect: this type is assigned when the effect of the DDI is described.
• Advice: this is assigned when a recommendation or advice regarding the concomitant use of two drugs is described. • Int (Interaction): this type is assigned when the sentence simply states that an interaction occurs and does not provide any detailed information about the interaction.
In the experiments, we employed the PyTorch (https:// pytor ch. org/) framework to implement our proposed model. For the selection of BERT model, we chose the BioBERT model to encode the input sentences. All of the DDI extraction methods use the standard evaluation measures (precision, recall and F-score) as the evaluation metrics. The F-score is defined as: F 1 = 2PR/(P + R). Table 3 illustrates the experimental results in detail. We compared our model with typical models based on CNN, RNN and pre-training. BioBERT is pre-trained in PubMed Abstracts (PubMed)and PubMed Central Fulle-Text Articles (PMC). CharacterBERT model use a character-CNN module instead of by query their characters to represent whole words. ChemicalBERT + AGGCN (Parallel) is a model combined GCN with ChemicalBERT for DDI task. In general, LSTM-based models achieve better results than CNN-based models because the LSTM structure can handle long text while CNNs focus more on local features. In addition, pre-trained-based models perform better than other methods, for example, the BioBERT model pre-trained with a large amount of biomedical text achieves an F1-score of 78.1%. The experimental results show that our method (IMSE) outperforms all the latest models, with a 7.06% higher F1-score than the baseline model BioBERT, as well as 1.56% and 2.28% higher F1-scores than the latest DESC_MOL and ChemicalBERT + AGGCN, respectively. Compared with other pre-training-based methods, the Interaction attention operation enables the model to focus more on information that facilitates correct classification results, i.e., critical information. The Entities attention operation enables the model to focus more on the drug entity itself and ignore other interfering drugs. Molecular structure as additional information can also provide the model with structural features in addition to text.

Ablation experiments
In this section, to explore the contribution of each component to overall performance, we performed an ablation study over our proposed model. We did a total of six comparative experiments, and the experimental results are presented in Table 4. As shown in the table, the F1-score of BERT(BioBERT) model when we did not add any operations was 78.1%.  Then when we add only Entities attention vector, the F1-score increases by 2.79% compared with BERT, which indicates that adding the Entities attention information is helpful in determining the relationship between the drugs. Next, we evaluated the impact of the Interaction attention vector which we proposed. When the Interaction attention vector is added, the F1-score increases by 5.66% compared with BERT, reached an F1-score of 83.76%. Then we added the Interaction attention vector and Entities attention vector at the same time in the fourth experiment, we label this method as 'BERT + Int* + Ent*' , as can be seen from the table, the precision score, recall score and F1-score of this method reached 85.54%, 83.56% and 84.47% respectively, it is fully demonstrated that both the information of Interaction attention and Entities attention can be well coordinated. In the fifth experiment (BERT + Int* + Ent* + MOL), we examine the influence of molecular structure on the model, we add molecular structure on the basis of the fourth experiment, the F1-score increases by 0.41% compared with the fourth experiment. Experimental results show that this method is effective, and also got the highest score (85.16%) currently compared to the existing model.

Performance on fivefold cross validation
We used fivefold cross validation to further explore the stability of the experimental results and the practicability of our method. As can be seen from Table 5, in the experimental results, the Interaction attention vector we proposed played a strong role in promoting the baseline model. For each of the four relationship categories we focused on, Interaction attention vector contribute to the performance of the model. Molecular structure also performed well in most of the results, but the effect was not as strong as Interaction attention vector. Finally, the performance of the model is greatly improved after we add all the useful information, and the experimental results also show that our method has strong generalization ability and stability.

Error and analysis
In order to ensure the fairness of the results, we only adopt over-sampling and undersampling processing for the training set. Although this is effective, the long-tail distribution of the test set itself cannot be solved, which is also a major feature and difficulty of biological data. In our test set, the data is very unevenly distributed. So, this was an important reason that affected the final results. Then the Interaction attention vector is mainly to better extract the information between two entities in a sentence, some sentences are very short, which can provide very little information to judge the relationship. Therefore, the performance can be improved by the Interaction attention vector is limited, and it is inevitable that there will be wrong classification results. In addition, in the process of obtaining drug structures from drug names, we first need to derive the molecular formulas of drugs from medical knowledge base. In this process, some drugs could not find the corresponding molecular formulas. On the other hand, in the process of obtaining molecular structures from SMILES, we got a plane structure or wrong information, and the actual molecular structures of drugs is three-dimensional, which will cause us to lose a lot of information outside a plane structure. All these errors will have a great impact when transferred to the model.

Interpretability
The performance of IMSE benefits from several major factors. (1) Interaction information contains key information of drug-drug interaction, and the introduction of Interaction attention vector improves the accuracy of features. (2) The introduction of entity information can fully extract the local information of entities in the feature space, which effectively reduces the interference of other entities to the model. (3) The addition of molecular structure effectively improves the richness of the feature space and provides information other than text.

Conclusion
In this paper, we propose a DDI extraction model based on BioBERT to improve the performance of DDI extraction, termed IMSE. In our model, we use Interaction attention vector which we proposed to enhance the interaction information in sentences to better deal with relationship overlap problem. The molecular structure information we add can take advantage of knowledge that cannot be learned in text and can better characterize the drug feature space. In addition, we added entity attention vectors to enhance the weights around the entities so that the model can better capture the information around the drug entities without being distracted by other drugs. Comparative experiments on benchmark datasets showed that IMSE had a better predictive performance than existing prediction models, improving DDI identification. The performance of the model in the ablation experiment also shows that each part of the